Spotify has globally become the go-to music library for all music lovers since the last decade. It’s recemmendation model which internally uses artists popularity and content-based filtering based on the genre the user likes is in itself a state-of-the-art model. This is an analysis on how different audio features can lead to popularity and their correlation with genre and subgenres.
Problem Statement
Music taste is subjective but repetitive, which means song listeners can have preference for certain Music Genres over others. Genre classification is a critical task for Music platforms like Spotify as it helps artists connect with new audiences. The volumne of new songs released everyday is in millions so automating this task is of paramount importance. However this is only possible when Genres have unique characteristics which differentiate them from each other. With this dataset we will explore the following:
Use cases for Spotify:
High Level Approach
We have dual objectives in mind when we’re evaluating the dataset. First to understand key characteristics of a Genre and secondly to determine key factors which contribute to Song popularity
Solution Overview:
Our current approach will be able to partially address the problem. Using EDA methods we are able to generate insights for:
The graphical techniques in combination with summary statistics help us understand the space of the data. It gave us the intuition that each Genre is its own cluster with unique characteristics, and using the right model we can capture those characteristics and make predictions about the Genre.
Benefits for the consumer of the analysis
The analysis will be useful for Spotify, enabling them to automatically classify songs based on song characteristics. Each day millions of songs are posted on the Spotify platform, it is crucial to automatically classify songs into Genres so that they can be fed to the recommendation algorithm which will enable for the songs to show up in the recommendation lists of users who are more likely to listen to the song.
Being able to predict the song popularity, allows Spotify to release the song in selective regions and selective userbase where it is likely to be more popular.
Following are the packages used:-
tidyverse = Allows for data manipulation and loading ggplot for graphics and dplyr for data manipulation
data.table = Reading large tables
gridExtra = For simultaneous display of multiple plots
corrplot = Correlation Matrix Visualization
DT = For HTML display of spotify dataset
kableExtra = For additional features on outputting tables
nnet = For Logistic Regression modeling
car = For checking multicollinearity through Variance Inflation Factor
Rtsne = Cluster grouping of audio features using TSNE algorithm
caret = Training models for hyperparameter tuning
class = For KNN modeling
e1071 = For SVM modeling
rpart = For decision tree modeling
rpart.plot = For plotting decision tree graph
randomForest = For Random Forest modeling
xgboost = For XGBoost modeling
doParallel = For parallel processing
kernlab = For hyperparameter tuning of SVM
ranger = For hyperparameter tuning of Random Forest
# loading packages
packages = c('tidyverse', 'data.table', 'gridExtra', 'corrplot', 'DT',
'kableExtra', 'nnet', 'car', 'Rtsne', 'caret', 'class', 'e1071',
'rpart', 'rpart.plot', 'randomForest', 'xgboost', 'doParallel',
'kernlab', 'ranger')
installed_packages = packages %in% rownames(installed.packages())
if (any(installed_packages == F)) {
install.packages(packages[!installed_packages])
}
# suppressing warnings
options(warn = -1)
suppressMessages(invisible(lapply(packages, library, character.only = T)))
Everything to know about the data to be here.
The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata around songs from Spotify’s API.
A subset of the data had already been extracted and is available for access on Github on which the analysis has been done. The song database consists of songs, its popularity, artists, the album to which the song belongs to from 6 main categories (EDM, Latin, Pop, R&B, Rap, & Rock) from Jan 1957 to Jan 2020.
| feature | data_type | description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | double | Duration of song in milliseconds |
URL = paste0("https://raw.githubusercontent.com/",
"rfordatascience/tidytuesday/master/data/",
"2020/2020-01-21/spotify_songs.csv")
spotify_df = fread(URL)
datatable(spotify_df, extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = I('colvis')))
The data cleaning process involved the following steps:-
Investigating data types
str(spotify_df)
## Classes 'data.table' and 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
## - attr(*, ".internal.selfref")=<externalptr>
Investigating missing rows
# checking missing values for each columns
colSums(is.na(spotify_df))
# counting total number of missing rows and removing them
missing_rows = spotify_df[rowSums(is.na(spotify_df)) > 0, ]
spotify_df = spotify_df[complete.cases(spotify_df), ]
Investigating song duration
# removing songs whose duration is too huge or less
duration_out = boxplot(spotify_df$duration_ms,
plot = F, range = 3)$out
spotify_df = spotify_df[!spotify_df$duration_ms %in% duration_out, ]
nrow(spotify_df[spotify_df$duration_ms < 60000, ])
spotify_df = spotify_df[spotify_df$duration_ms > 60000, ]
rm(duration_out)
Removing columns
# removing playlist name as they are names given by the users
# which are highly subjective and add least information
unique(spotify_df$playlist_name)
spotify_df = spotify_df[, -c("playlist_name", "playlist_id")]
Investigating duplicates
# checking duplicates based on track name, artist and release date
duplicate_rows = spotify_df[duplicated(
spotify_df[, c("track_name", "track_artist", "track_album_release_date")])]
duplicate_rows = duplicate_rows[order(track_name), ]
Adding new columns and manipulating existing ones
# returns the mode of categorical values
# if number of occurence is same, returns all of them
mode = function(x, type) {
if(type == 'subgenre') {
cat_values = as.factor(x)
levels(cat_values)[which.max(tabulate(cat_values))]
} else {
cat_values = as.factor(x)
temp = table(x)
paste(levels(cat_values)[which(temp == max(temp))], collapse = ',')
}
}
# aggregating different genre and subgenre in a single row separated by comma
# adding columns for the mode of genre and subgenre for each track
spotify_df = spotify_df %>%
group_by(track_name, track_artist, track_album_release_date) %>%
mutate(playlist_genre_new = map_chr(
playlist_genre, ~toString(setdiff(playlist_genre, .x))),
playlist_subgenre_new = map_chr(
playlist_subgenre, ~toString(setdiff(playlist_subgenre, .x))),
genre_mode = mode(playlist_genre, "genre"),
subgenre_mode = mode(playlist_subgenre, "subgenre")) %>%
ungroup()
spotify_df = unite(spotify_df, "playlist_genre",
c("playlist_genre", "playlist_genre_new"),
sep = ",")
spotify_df = unite(spotify_df, "playlist_subgenre",
c("playlist_subgenre", "playlist_subgenre_new"),
sep = ",")
spotify_df$playlist_genre = gsub(",$", "", spotify_df$playlist_genre)
spotify_df$playlist_subgenre = gsub(",$", "", spotify_df$playlist_subgenre)
spotify_df = spotify_df[!duplicated(spotify_df[c("track_name", "track_artist",
"track_album_release_date")]), ]
# separating date to year, month and day
# assuming the day to be the 1st of the month where missing
# assuming the month to be Jan where month is missing
spotify_df = separate(spotify_df, col = track_album_release_date,
into = c("year", "month", "day"), sep = "-")
colSums(is.na(spotify_df))
spotify_df[is.na(spotify_df)] = "01"
spotify_df[c("year","month", "day")] = sapply(spotify_df[c("year","month", "day")],
as.integer)
# changing multigenre mode to a single genre based on the euclidean distance
# assigned to the closest the audio features are to the median of concerned genres
audio_features = colnames(spotify_df)[12:23]
single_genre_df = filter(spotify_df, !grepl(",", genre_mode))
multi_genre_df = filter(spotify_df, grepl(",", genre_mode))
median_df = single_genre_df %>%
select(c('genre_mode', all_of(audio_features))) %>%
group_by(genre_mode) %>%
summarise_if(is.numeric, median) %>%
ungroup()
for(i in 1:nrow(multi_genre_df)) {
temp = multi_genre_df[i, c('genre_mode', audio_features)]
multi_genres = strsplit(temp$genre_mode, ",")[[1]]
dist_vector = c()
for(j in 1:length(multi_genres)) {
median_values = filter(median_df, genre_mode == multi_genres[j])
eucli_dist = dist(rbind(temp[, audio_features],
median_values[, audio_features])[, -c(12)])[1]
dist_vector = append(dist_vector, eucli_dist)
}
multi_genre_df$genre_mode[i] = multi_genres[which(dist_vector == min(dist_vector))]
}
spotify_df = rbind(single_genre_df, multi_genre_df)
rm(median_df, median_values, multi_genre_df, single_genre_df, temp)
Following is the frequency count of genres.
| Genre | Count |
|---|---|
| edm | 4969 |
| latin | 4170 |
| pop | 4406 |
| r&b | 4590 |
| rap | 5094 |
| rock | 4166 |
After data cleaning, the final dataset looks like below:-
This sections deals with exploring audio features song genre and track popularity
Distribution of audio features
# checking the distribution of audio features
plot_list = list()
for (i in 1:length(audio_features)) {
plot_list[[i]] = ggplot(spotify_df, aes_string(x = audio_features[i])) +
geom_density(color = "darkblue", fill = "lightblue")
}
do.call(grid.arrange,
c(plot_list, list(top = "Density distribution of audio features")))
Boxlot representation of audio features
# checking for outliers in audio features
plot_list = list()
for (i in 1:length(audio_features)) {
plot_list[[i]] = ggplot(spotify_df, aes_string(y = audio_features[i])) +
geom_boxplot(color = "darkblue", fill = "lightblue", outlier.colour = "red",
outlier.shape = 1, outlier.alpha = 0.5)
}
do.call(grid.arrange,
c(plot_list, list(top = "Boxplot representation of audio features")))
Default percentage of outliers based on boxplot
# getting default percentage of outliers based on boxplot
for (i in 1:length(audio_features)) {
out_percent = length(boxplot(spotify_df[, c(audio_features[i])],
plot = F, range = 1.5)
$out) * 100 / nrow(spotify_df)
print(paste0("The outlier percentage for ",
audio_features[i], " is ", round(out_percent, 2), "%"))
}
## [1] "The outlier percentage for danceability is 0.89%"
## [1] "The outlier percentage for energy is 0.77%"
## [1] "The outlier percentage for key is 0%"
## [1] "The outlier percentage for loudness is 2.97%"
## [1] "The outlier percentage for mode is 0%"
## [1] "The outlier percentage for speechiness is 9.56%"
## [1] "The outlier percentage for acousticness is 6.7%"
## [1] "The outlier percentage for instrumentalness is 21.5%"
## [1] "The outlier percentage for liveness is 5.73%"
## [1] "The outlier percentage for valence is 0%"
## [1] "The outlier percentage for tempo is 1.71%"
## [1] "The outlier percentage for duration_ms is 3.64%"
Boxplot representation of audio features based on genre
# checking the range in audio features based on genre
plot_list = list()
for (i in 1:length(audio_features)) {
plot_list[[i]] = ggplot(spotify_df,
aes_string(x = "genre_mode", y = audio_features[i])) +
geom_boxplot(color = "darkblue", fill = "lightblue", outlier.colour = "red",
outlier.shape = 1, outlier.alpha = 0.5) +
xlab("genre")
}
do.call(grid.arrange,
c(plot_list, list(top = "Boxplot representation of audio features for each genre")))
Correlation of audio features
# correlation between audio features
spotify_df %>%
select(all_of(audio_features)) %>%
scale() %>%
cor() %>%
corrplot::corrplot(method = 'color',
order = 'hclust',
type = 'upper',
diag = FALSE,
tl.col = 'black',
addCoef.col = "grey30",
number.cex = 0.6,
main = 'Correlation among audio features',
mar = c(2,2,2,2),
family = 'Avenir')
Audio features over the years
# audio features over the years
# taking mean for each year
year_features_df = spotify_df %>%
select(c('year', all_of(audio_features))) %>%
group_by(year) %>%
summarise_if(is.numeric, mean) %>%
ungroup()
plot_list = list()
for (i in 1:length(audio_features)) {
plot_list[[i]] = ggplot(year_features_df,
aes_string(x = "year", y = audio_features[i])) +
geom_line()
}
do.call(grid.arrange,
c(plot_list, list(top = "Audio features over the years")))
Correlation among genres
# correlation within genre
# getting median values for each genre and finding correlation between them
genre_audio_df = spotify_df %>%
select(c('genre_mode', all_of(audio_features)), -c(mode, key)) %>%
group_by(genre_mode) %>%
summarise_if(is.numeric, median) %>%
ungroup()
genre_audio_df = select(genre_audio_df, -genre_mode)
# scaling the values for better correlation mapping
genre_audio_df = scale(genre_audio_df)
genre_audio_df = t(genre_audio_df)
colnames(genre_audio_df) = sort(unique(spotify_df$genre_mode))
genre_audio_df %>%
cor() %>%
corrplot::corrplot(method = 'color',
order = 'hclust',
type = 'upper',
diag = FALSE,
tl.col = 'black',
addCoef.col = "grey30",
main = 'Correlation among genres',
mar = c(2,2,2,2),
family = 'Avenir',
number.cex=0.85)
Correlation of popularity with audio features
# correlation of popularity with audio features
popularity_features_df = spotify_df %>%
select(c('track_popularity', all_of(audio_features))) %>%
group_by(track_popularity) %>%
summarise_if(is.numeric, mean) %>%
ungroup()
plot_list = list()
for (i in 1:length(audio_features)) {
plot_list[[i]] = ggplot(popularity_features_df, aes_string(x = "track_popularity",
y = audio_features[i])) +
geom_point(shape = 18, color = 4) +
geom_smooth(method = lm, linetype = "dashed", color = "darkred", se = F) +
xlab("popularity")
}
suppressMessages(do.call(grid.arrange,
c(plot_list, list(top = "Correlation of popularity with audio features"))))
Popularity of genres over the years
# popularity of genres over the years
year_genre_features_df = spotify_df %>%
select(c('year', 'genre_mode', 'track_popularity')) %>%
group_by(year, genre_mode) %>%
summarise_if(is.numeric, mean) %>%
ungroup()
plot_list = list()
for (i in 1:length(genres)) {
temp = filter(year_genre_features_df, genre_mode == genres[i])
if (nrow(temp) > 0) {
plot_list[[i]] = ggplot(temp,
aes_string(x = "year", y = "track_popularity")) +
geom_line() +
ggtitle(paste0("Popularity trend for ", genres[i])) +
ylab("popularity")
}
}
do.call(grid.arrange, plot_list)
Impact of holiday season on any genre
# analysis if holiday season impacts any particular genre
popularity_month_df = spotify_df %>%
select('month', 'genre_mode', 'track_popularity') %>%
group_by(month, genre_mode) %>%
summarise(popularity = mean(track_popularity)) %>%
ungroup()
ggplot(popularity_month_df, aes(x = month, y = popularity)) +
geom_line(aes(color = genre_mode)) + theme_bw() +
ggtitle("Month-wise popularity of genres") +
ylab("popularity") + labs(color = "Genre")
Number of songs released over the years
# trend for number of songs released
song_count_df = spotify_df %>%
select('year') %>%
filter(year <= 2019) %>%
group_by(year) %>%
summarise(songs_released = n()) %>%
ungroup()
ggplot(song_count_df,
aes_string(x = "year", y = "songs_released")) +
geom_line() +
ggtitle("Number of songs released over the years") +
ylab("songs released")
Number of songs released for each genre in the last 10 years
# number of songs released for each genre in the last 10 years
song_count_df = spotify_df %>%
select('year', 'genre_mode') %>%
filter(year > 2009 & year <= 2019) %>%
group_by(year, genre_mode) %>%
summarise(songs_released = n()) %>%
ungroup()
ggplot(song_count_df, aes(x = year, y = songs_released)) +
geom_line(aes(color = genre_mode)) + theme_bw() +
ggtitle("Number of songs released in the last 10 years for each genre") +
ylab("songs released") + labs(color = "Genre")
This is the modeling section where the genre is tried to be classified based on the audio features
Data Filtering
# filtering out data before 1970 due to different spikes as observed in the EDA
spotify_df = filter(spotify_df, (year > 1970))
summary(spotify_df[, audio_features])
## danceability energy key loudness
## Min. :0.0771 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5620 1st Qu.:0.579000 1st Qu.: 2.000 1st Qu.: -8.264
## Median :0.6710 Median :0.722000 Median : 6.000 Median : -6.237
## Mean :0.6544 Mean :0.698464 Mean : 5.373 Mean : -6.790
## 3rd Qu.:0.7600 3rd Qu.:0.843000 3rd Qu.: 9.000 3rd Qu.: -4.699
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0224 Min. :0.0000014 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0411 1st Qu.:0.0144000 1st Qu.:0.0000000
## Median :1.0000 Median :0.0630 Median :0.0798000 Median :0.0000195
## Mean :0.5622 Mean :0.1086 Mean :0.1775430 Mean :0.0901270
## 3rd Qu.:1.0000 3rd Qu.:0.1350 3rd Qu.:0.2600000 3rd Qu.:0.0061300
## Max. :1.0000 Max. :0.9180 Max. :0.9920000 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.00936 Min. :0.00001 Min. : 35.48 Min. : 60447
## 1st Qu.:0.09280 1st Qu.:0.32800 1st Qu.: 99.97 1st Qu.:187492
## Median :0.12700 Median :0.51100 Median :121.99 Median :216240
## Mean :0.19108 Mean :0.50953 Mean :120.93 Mean :224521
## 3rd Qu.:0.24900 3rd Qu.:0.69300 3rd Qu.:134.00 3rd Qu.:253600
## Max. :0.99600 Max. :0.99100 Max. :239.44 Max. :450907
Feature Selection and Standardization
# taking only audio features for genre classification
spotify_df = spotify_df %>%
select(c('genre_mode'), all_of(audio_features))
colnames(spotify_df)[1] = "genre"
# scaling audio features
spotify_df = spotify_df %>%
mutate_if(is.numeric, scale)
Multicollinearity sanity-check
logistic_model = multinom(genre ~., data = spotify_df)
# checking for multicollinearity
vif(logistic_model)
## danceability energy key loudness
## 2.523998 4.692985 1.847775 4.208273
## mode speechiness acousticness instrumentalness
## 1.825805 2.340076 3.927155 1.478525
## liveness valence tempo duration_ms
## 1.633940 2.501153 2.104063 1.801746
TSNE cluster analysis
# observing inter-distance of genres and intra-distance within genres
# TSNE algorithm shows very poor clustering
temp = spotify_df %>%
mutate(ID = row_number())
tsne_fit = temp %>%
select('ID', all_of(audio_features)) %>%
column_to_rownames("ID") %>%
Rtsne(check_duplicates = F)
tsne_df = tsne_fit$Y %>%
as.data.frame() %>%
rename(tSNE1 = "V1", tSNE2 = "V2") %>%
mutate(ID = row_number())
tsne_df = tsne_df %>%
inner_join(temp, by = "ID")
tsne_df %>%
ggplot(aes(x = tSNE1,
y = tSNE2,
color = genre)) +
geom_point() +
theme(legend.position = "bottom")
Train-Test split
# train-test split
index = createDataPartition(spotify_df$genre, p = 0.7, list = F)
train_df = spotify_df[index,]
test_df = spotify_df[-index,]
Logistic Regression
# Logistic Regression
logistic_model = multinom(genre ~., data = train_df)
predicted_genre = predict(logistic_model, newdata = train_df[-1])
confusionMatrix(data = predicted_genre,
reference = as.factor(train_df$genre))$overall[1]
## Accuracy
## 0.4731843
predicted_genre = predict(logistic_model, test_df[-1])
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
##
## Reference
## Prediction edm latin pop r&b rap rock
## edm 894 171 241 63 174 115
## latin 115 442 164 159 167 54
## pop 189 134 331 133 85 86
## r&b 63 185 178 581 221 143
## rap 130 201 123 249 823 26
## rock 99 117 284 182 57 782
##
## Overall Statistics
##
## Accuracy : 0.4721
## 95% CI : (0.4612, 0.483)
## No Information Rate : 0.1871
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3655
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity 0.6000 0.35360 0.25057 0.42502 0.5390
## Specificity 0.8855 0.90464 0.90833 0.88372 0.8901
## Pos Pred Value 0.5392 0.40145 0.34551 0.42378 0.5303
## Neg Pred Value 0.9083 0.88555 0.86256 0.88424 0.8935
## Prevalence 0.1826 0.15317 0.16187 0.16750 0.1871
## Detection Rate 0.1095 0.05416 0.04056 0.07119 0.1008
## Detection Prevalence 0.2032 0.13491 0.11739 0.16799 0.1902
## Balanced Accuracy 0.7427 0.62912 0.57945 0.65437 0.7145
## Class: rock
## Sensitivity 0.64842
## Specificity 0.89375
## Pos Pred Value 0.51414
## Neg Pred Value 0.93614
## Prevalence 0.14778
## Detection Rate 0.09582
## Detection Prevalence 0.18637
## Balanced Accuracy 0.77109
# calculating p-value
z = summary(logistic_model)$coefficients / summary(logistic_model)$standard.errors
p = (1 - pnorm(abs(z), 0, 1)) * 2
print(p)
## (Intercept) danceability energy key loudness mode
## latin 8.659740e-15 0.55939680 0.0000000 0.3701239 1.458917e-05 2.863748e-05
## pop 0.000000e+00 0.00000000 0.0000000 0.4889659 1.318093e-03 1.504019e-06
## r&b 1.629307e-05 0.00000000 0.0000000 0.6680378 1.283080e-09 3.233572e-01
## rap 3.390621e-13 0.01329863 0.0000000 0.8392161 2.174205e-04 8.476298e-02
## rock 0.000000e+00 0.00000000 0.8517028 0.5638046 0.000000e+00 0.000000e+00
## speechiness acousticness instrumentalness liveness valence
## latin 0.00267331 0.000000e+00 0 2.752921e-06 0
## pop 0.00000000 3.396404e-09 0 5.052625e-12 0
## r&b 0.00000000 0.000000e+00 0 8.066808e-06 0
## rap 0.00000000 1.669294e-08 0 5.642404e-02 0
## rock 0.00000000 2.091391e-04 0 1.831729e-07 0
## tempo duration_ms
## latin 0.000000e+00 0.05473688
## pop 2.509104e-14 0.19080674
## r&b 0.000000e+00 0.00000000
## rap 5.280221e-13 0.42707376
## rock 0.000000e+00 0.00000000
# Logistic Regression without key, mode and duration_ms
logistic_model = multinom(genre ~., data = train_df[-c(4, 6, 13)])
predicted_genre = predict(logistic_model, test_df[-c(1, 4, 6, 13)])
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
##
## Reference
## Prediction edm latin pop r&b rap rock
## edm 896 177 255 63 175 119
## latin 127 438 153 189 175 58
## pop 173 120 301 153 83 97
## r&b 54 195 223 518 195 138
## rap 122 204 107 267 841 23
## rock 118 116 282 177 58 771
##
## Overall Statistics
##
## Accuracy : 0.4613
## 95% CI : (0.4505, 0.4722)
## No Information Rate : 0.1871
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3525
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity 0.6013 0.35040 0.22786 0.37893 0.5508
## Specificity 0.8817 0.89842 0.90848 0.88151 0.8910
## Pos Pred Value 0.5318 0.38421 0.32470 0.39153 0.5377
## Neg Pred Value 0.9083 0.88435 0.85900 0.87584 0.8960
## Prevalence 0.1826 0.15317 0.16187 0.16750 0.1871
## Detection Rate 0.1098 0.05367 0.03688 0.06347 0.1031
## Detection Prevalence 0.2065 0.13969 0.11359 0.16211 0.1916
## Balanced Accuracy 0.7415 0.62441 0.56817 0.63022 0.7209
## Class: rock
## Sensitivity 0.63930
## Specificity 0.89202
## Pos Pred Value 0.50657
## Neg Pred Value 0.93448
## Prevalence 0.14778
## Detection Rate 0.09447
## Detection Prevalence 0.18650
## Balanced Accuracy 0.76566
K-NN
predicted_genre = knn(train = train_df[, -1],
test = test_df[, -1],
cl = train_df$genre,
k = 5)
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
##
## Reference
## Prediction edm latin pop r&b rap rock
## edm 906 152 249 74 148 103
## latin 117 455 204 184 190 76
## pop 223 190 376 198 118 174
## r&b 57 182 180 515 241 153
## rap 92 182 101 252 778 36
## rock 95 89 211 144 52 664
##
## Overall Statistics
##
## Accuracy : 0.4526
## 95% CI : (0.4418, 0.4635)
## No Information Rate : 0.1871
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3422
##
## Mcnemar's Test P-Value : 0.005594
##
## Statistics by Class:
##
## Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity 0.6081 0.36400 0.28463 0.37674 0.50950
## Specificity 0.8912 0.88844 0.86798 0.88034 0.90006
## Pos Pred Value 0.5551 0.37113 0.29398 0.38780 0.53990
## Neg Pred Value 0.9106 0.88536 0.86269 0.87531 0.88854
## Prevalence 0.1826 0.15317 0.16187 0.16750 0.18711
## Detection Rate 0.1110 0.05575 0.04607 0.06311 0.09533
## Detection Prevalence 0.2000 0.15023 0.15672 0.16273 0.17657
## Balanced Accuracy 0.7496 0.62622 0.57631 0.62854 0.70478
## Class: rock
## Sensitivity 0.55058
## Specificity 0.91503
## Pos Pred Value 0.52908
## Neg Pred Value 0.92152
## Prevalence 0.14778
## Detection Rate 0.08136
## Detection Prevalence 0.15378
## Balanced Accuracy 0.73280
SVM
svm_model = svm(as.factor(genre) ~ ., data = train_df, kernel = "radial")
predicted_genre = predict(svm_model, newdata = train_df[-1])
confusionMatrix(data = predicted_genre,
reference = as.factor(train_df$genre))$overall[1]
## Accuracy
## 0.5733627
predicted_genre = predict(svm_model, test_df[-1], type = "C-classification")
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
##
## Reference
## Prediction edm latin pop r&b rap rock
## edm 1000 126 220 53 112 65
## latin 71 457 117 125 114 35
## pop 191 190 453 142 76 128
## r&b 49 169 184 647 161 141
## rap 109 240 126 276 1022 23
## rock 70 68 221 124 42 814
##
## Overall Statistics
##
## Accuracy : 0.5383
## 95% CI : (0.5274, 0.5492)
## No Information Rate : 0.1871
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4444
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity 0.6711 0.3656 0.34292 0.47330 0.6693
## Specificity 0.9137 0.9332 0.89371 0.89638 0.8833
## Pos Pred Value 0.6345 0.4973 0.38390 0.47890 0.5690
## Neg Pred Value 0.9256 0.8905 0.87566 0.89427 0.9207
## Prevalence 0.1826 0.1532 0.16187 0.16750 0.1871
## Detection Rate 0.1225 0.0560 0.05551 0.07928 0.1252
## Detection Prevalence 0.1931 0.1126 0.14459 0.16554 0.2201
## Balanced Accuracy 0.7924 0.6494 0.61832 0.68484 0.7763
## Class: rock
## Sensitivity 0.67496
## Specificity 0.92451
## Pos Pred Value 0.60792
## Neg Pred Value 0.94254
## Prevalence 0.14778
## Detection Rate 0.09974
## Detection Prevalence 0.16407
## Balanced Accuracy 0.79974
Decision Tree
dt_model = rpart(genre ~ ., data = train_df)
predicted_genre = predict(dt_model, newdata = train_df[-1])
predicted_genre = colnames(predicted_genre)[apply(predicted_genre, 1, which.max)]
confusionMatrix(data = as.factor(predicted_genre),
reference = as.factor(train_df$genre))$overall[1]
## Accuracy
## 0.4109992
predicted_genre = predict(dt_model, newdata = test_df[-1], type = "class")
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
##
## Reference
## Prediction edm latin pop r&b rap rock
## edm 629 125 184 52 45 60
## latin 61 316 181 135 126 33
## pop 85 154 224 117 73 95
## r&b 50 120 96 303 78 136
## rap 401 436 328 529 1103 204
## rock 264 99 308 231 102 678
##
## Overall Statistics
##
## Accuracy : 0.3986
## 95% CI : (0.388, 0.4093)
## No Information Rate : 0.1871
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2749
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity 0.42215 0.25280 0.16957 0.22165 0.7223
## Specificity 0.93015 0.92244 0.92339 0.92935 0.7139
## Pos Pred Value 0.57443 0.37089 0.29947 0.38697 0.3675
## Neg Pred Value 0.87815 0.87221 0.85202 0.85579 0.9178
## Prevalence 0.18258 0.15317 0.16187 0.16750 0.1871
## Detection Rate 0.07707 0.03872 0.02745 0.03713 0.1352
## Detection Prevalence 0.13417 0.10440 0.09166 0.09594 0.3677
## Balanced Accuracy 0.67615 0.58762 0.54648 0.57550 0.7181
## Class: rock
## Sensitivity 0.56219
## Specificity 0.85564
## Pos Pred Value 0.40309
## Neg Pred Value 0.91851
## Prevalence 0.14778
## Detection Rate 0.08308
## Detection Prevalence 0.20610
## Balanced Accuracy 0.70892
rpart.plot(dt_model,
type = 5,
extra = 104,
box.palette = list(purple = "#490B32",
red = "#9A031E",
orange = '#FB8B24',
dark_blue = "#0F4C5C",
blue = "#5DA9E9",
grey = '#66717E'),
leaf.round = 0,
fallen.leaves = FALSE,
branch = 0.3,
under = TRUE,
under.col = 'grey40',
family = 'Avenir',
main = 'Genre Decision Tree',
tweak = 1.2)
Random Forest
rf_model = randomForest(as.factor(genre) ~ ., data = train_df,
ntree = 500, importance = T)
predicted_genre = predict(rf_model, newdata = train_df[-1])
confusionMatrix(data = predicted_genre,
reference = as.factor(train_df$genre))$overall[1]
## Accuracy
## 0.9962217
predicted_genre = predict(rf_model, newdata = test_df[-1])
confusionMatrix(data = predicted_genre, reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
##
## Reference
## Prediction edm latin pop r&b rap rock
## edm 1059 109 215 37 91 43
## latin 69 517 133 106 110 33
## pop 188 171 450 133 57 123
## r&b 46 173 187 668 163 119
## rap 81 215 117 293 1060 23
## rock 47 65 219 130 46 865
##
## Overall Statistics
##
## Accuracy : 0.566
## 95% CI : (0.5551, 0.5768)
## No Information Rate : 0.1871
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4778
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity 0.7107 0.41360 0.34065 0.48866 0.6942
## Specificity 0.9258 0.93474 0.90175 0.89873 0.8901
## Pos Pred Value 0.6815 0.53409 0.40107 0.49263 0.5925
## Neg Pred Value 0.9348 0.89810 0.87626 0.89728 0.9267
## Prevalence 0.1826 0.15317 0.16187 0.16750 0.1871
## Detection Rate 0.1298 0.06335 0.05514 0.08185 0.1299
## Detection Prevalence 0.1904 0.11861 0.13748 0.16616 0.2192
## Balanced Accuracy 0.8183 0.67417 0.62120 0.69370 0.7921
## Class: rock
## Sensitivity 0.7172
## Specificity 0.9271
## Pos Pred Value 0.6305
## Neg Pred Value 0.9498
## Prevalence 0.1478
## Detection Rate 0.1060
## Detection Prevalence 0.1681
## Balanced Accuracy 0.8222
XGBoost
xgb_model = xgboost(data = as.matrix(train_df[-1]),
label = as.integer(as.factor(train_df$genre)),
nrounds = 25,
verbose = FALSE,
params = list(objective = "multi:softmax",
num_class = 6 + 1))
predicted_genre = predict(xgb_model, newdata = as.matrix(train_df[-1]))
predicted_genre = levels(as.factor(train_df$genre))[predicted_genre]
confusionMatrix(data = as.factor(predicted_genre),
reference = as.factor(train_df$genre))$overall[1]
## Accuracy
## 0.6870802
predicted_genre = predict(xgb_model, newdata = as.matrix(test_df[-1]))
predicted_genre = levels(as.factor(test_df$genre))[predicted_genre]
confusionMatrix(data = as.factor(predicted_genre),
reference = as.factor(test_df$genre))
## Confusion Matrix and Statistics
##
## Reference
## Prediction edm latin pop r&b rap rock
## edm 1056 111 196 38 76 44
## latin 62 501 124 135 122 36
## pop 201 212 467 133 89 136
## r&b 55 166 182 644 179 127
## rap 72 198 129 286 1018 31
## rock 44 62 223 131 43 832
##
## Overall Statistics
##
## Accuracy : 0.5536
## 95% CI : (0.5427, 0.5644)
## No Information Rate : 0.1871
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.463
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity 0.7087 0.40080 0.35352 0.47110 0.6667
## Specificity 0.9303 0.93069 0.88728 0.89564 0.8921
## Pos Pred Value 0.6943 0.51122 0.37722 0.47598 0.5871
## Neg Pred Value 0.9346 0.89570 0.87664 0.89380 0.9208
## Prevalence 0.1826 0.15317 0.16187 0.16750 0.1871
## Detection Rate 0.1294 0.06139 0.05722 0.07891 0.1247
## Detection Prevalence 0.1864 0.12008 0.15170 0.16579 0.2125
## Balanced Accuracy 0.8195 0.66575 0.62040 0.68337 0.7794
## Class: rock
## Sensitivity 0.6899
## Specificity 0.9277
## Pos Pred Value 0.6232
## Neg Pred Value 0.9452
## Prevalence 0.1478
## Detection Rate 0.1019
## Detection Prevalence 0.1636
## Balanced Accuracy 0.8088
Feature Importance of Decision Tree, Random Forest and XGBoost
# comparing feature importance between Decision Tree, Random Forest and XGBoost
importance_dt = data.frame(importance = dt_model$variable.importance)
importance_dt$feature = row.names(importance_dt)
importance_rf = data.frame(importance = randomForest::importance(rf_model, type = 2))
importance_rf$feature = row.names(importance_rf)
importance_xgb = xgb.importance(model = xgb_model)
compare_importance = importance_xgb %>%
select(Feature, Gain) %>%
left_join(importance_dt, by = c('Feature' = 'feature')) %>%
left_join(importance_rf, by = c('Feature' = 'feature')) %>%
rename('xgboost' = 'Gain',
'decision_tree' = 'importance',
'random_forest' = 'MeanDecreaseGini')
compare_importance = compare_importance %>%
mutate_if(is.numeric, scale) %>%
pivot_longer(cols = c('xgboost', 'decision_tree', 'random_forest')) %>%
rename('model' = 'name')
ggplot(compare_importance, aes(x = reorder(Feature, value, na.rm = T), y = value, color = model)) +
geom_point(size = 2) +
coord_flip() +
labs(title = 'Variable Importance by Model',
y = 'Scaled value', x = '')
Please note that hyperparameter tuning took about an hour to run after parallelizing and hence has been commented. Please uncomment the code to run the tuning model. Also, Random Search has been used to make it less computationally expensive instead of Grid Search. Hence the results of the hyperparameters will most likely differ.
Creating parallel processing
# cl = makePSOCKcluster(6)
# registerDoParallel(cl)
# fitControl = trainControl(search = 'random', method = "repeatedcv",
# number = 5, repeats = 3, allowParallel = T)
SVM
# svm_fit = train(genre ~ ., data = train_df,
# method = "svmRadial",
# trControl = fitControl,
# verbose = F,
# tuneLength = 2)
# print(svm_fit)
Random Forest
# rf_fit = train(genre ~ ., data = train_df,
# method = "ranger",
# trControl = fitControl,
# verbose = F,
# tuneLength = 5)
# print(rf_fit)
XGBoost
# xgb_fit = train(genre ~ ., data = train_df,
# method = "xgbTree",
# trControl = fitControl,
# verbose = F,
# tuneLength = 10)
# print(xgb_fit)
# stopCluster(cl)
Parameters that Random Search outputted in one of the iterations:-
Stacking
Using the hyperparameters found, we will create a stacking layer from SVM, Random Forest and XGBoost for Logistic Regression
svm_model = svm(as.factor(genre) ~ .,
data = train_df,
kernel = 'radial',
sigma = 0.01,
C = 18.8)
svm_pred = predict(svm_model, newdata = train_df[-1])
svm_pred_test = predict(svm_model, newdata = test_df[-1])
rf_model = randomForest(as.factor(genre) ~ .,
data = train_df,
ntree = 500,
importance = T,
mtry = 6,
min.node.size = 18)
rf_pred = predict(rf_model, newdata = train_df[-1])
rf_pred_test = predict(rf_model, newdata = test_df[-1])
xgb_model = xgboost(data = as.matrix(train_df[-1]),
label = as.integer(as.factor(train_df$genre)),
nrounds = 910,
max_depth = 5,
eta = 0.012,
gamma = 3.8,
colsample_bytree = 0.5,
min_child_weight = 14,
subsample = 0.85,
verbose = FALSE,
params = list(objective = "multi:softmax",
num_class = 6 + 1))
xgb_pred = predict(xgb_model, newdata = as.matrix(train_df[-1]))
xgb_pred = levels(as.factor(train_df$genre))[xgb_pred]
xgb_pred_test = predict(xgb_model, newdata = as.matrix(test_df[-1]))
xgb_pred_test = levels(as.factor(test_df$genre))[xgb_pred_test]
stacked_df = data.frame(svm = svm_pred,
rf = rf_pred,
xgb = xgb_pred,
genre = train_df[1])
stacked_df_test = data.frame(svm = svm_pred_test,
rf = rf_pred_test,
xgb = xgb_pred_test,
genre = test_df[1])
logistic_model = multinom(genre ~., data = stacked_df)
predicted_genre = predict(logistic_model, newdata = stacked_df[-4])
confusionMatrix(data = as.factor(predicted_genre),
reference = as.factor(stacked_df$genre))$overall[1]
## Accuracy
## 0.9962741
predicted_genre = predict(logistic_model, stacked_df_test[-4])
confusionMatrix(data = predicted_genre,
reference = as.factor(stacked_df_test$genre))
## Confusion Matrix and Statistics
##
## Reference
## Prediction edm latin pop r&b rap rock
## edm 1041 108 208 31 86 39
## latin 70 518 144 116 106 38
## pop 204 177 447 143 69 124
## r&b 39 166 187 668 169 121
## rap 84 214 123 289 1056 24
## rock 52 67 212 120 41 860
##
## Overall Statistics
##
## Accuracy : 0.5624
## 95% CI : (0.5516, 0.5732)
## No Information Rate : 0.1871
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4736
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: edm Class: latin Class: pop Class: r&b Class: rap
## Sensitivity 0.6987 0.41440 0.33838 0.48866 0.6916
## Specificity 0.9292 0.93141 0.89518 0.89962 0.8894
## Pos Pred Value 0.6880 0.52218 0.38402 0.49481 0.5899
## Neg Pred Value 0.9325 0.89789 0.87509 0.89737 0.9261
## Prevalence 0.1826 0.15317 0.16187 0.16750 0.1871
## Detection Rate 0.1276 0.06347 0.05477 0.08185 0.1294
## Detection Prevalence 0.1854 0.12155 0.14263 0.16542 0.2193
## Balanced Accuracy 0.8140 0.67291 0.61678 0.69414 0.7905
## Class: rock
## Sensitivity 0.7131
## Specificity 0.9293
## Pos Pred Value 0.6361
## Neg Pred Value 0.9492
## Prevalence 0.1478
## Detection Rate 0.1054
## Detection Prevalence 0.1657
## Balanced Accuracy 0.8212
Genre
The genre classfication model achieved an accuracy of 57%. Though better than a random guess ( \(1/6\) ), the exercise gave a lot of inner understanding of the data space and what could have been better
speechiness, danceability and tempo were the main features that helped in identifying a few genres. Rap ( \(68\)% accuracy ) was associated with high speechiness, Rock ( \(68\)% accuracy ) could be explained by low danceability and EDM (\(72\)% accuracy ) was linked with high tempo. Latin ( \(40\)% accuracy ), R&B ( \(47\)% accuracy ) and Pop ( \(36\)% accuracy ) were the most difficult genres to classify though Latin was somewhat differentiated by high danceability and R&B had a relatively high duration_ms. The models had a hard time recognizing pop as a genre due to it’s high range of almost every audio feauture and hence had the lowest accuracy, sensitivity and specificity.
Popularity